Detect Medicare Fraud in the US through Classification Model

Medical fraud is a major contributor to inflating expenses and waste in the US healthcare system. Forbes identifies the industry as among the most vulnerable to fraud & cyber-attacks. This investigation explores physician-level Medicare and Medicaid data through exploratory data analysis and a classfication model, hoping to show how big data can be used to prevent this problem.

The inspiration for this project came from my coleague at IBM, whose mother was a victim of medical ID theft. Although the problem of ID theft is slightly different (would require transctional data), CMS and LEIE data are among the best publicly available sources for healthcare and thus were used as a proof-of-concept. It reaffirms that, similar to credit card fraud detection, fraudulent practices in healthcare can be prevented and alleviated. The most vulnerable demographics, dependent elders of 65 years and older, would benefit the most from this technology.

Healthcare fraud in the news:

Data sources:

In cleaning and combining the two datasets, I followed closely previous literature, which are both listed in the "References" section.

In [2]:
import os
import pandas as pd
In [3]:
# Set up data directory
CWD = os.getcwd()
cms_data_dir = os.path.join(CWD, 'CMSData')
In [4]:
# Some years columns are capitalized and other years the columns are lowercase:
capitalization_dict = {
    '2012': str.upper,
    '2013': str.upper,
    '2014': str.lower,
    '2015': str.lower,
    '2016': str.upper,
    '2017': str.lower,
}

1. CMS Part B dataset

In [5]:
# Set dtypes
partB_dtypes = {
    'npi': 'str',
    'nppes_provider_last_org_name': 'str',
    'nppes_provider_first_name': 'str',
    'nppes_provider_mi': 'str',
    'nppes_credentials': 'str',
    'nppes_provider_gender': 'str',
    'nppes_entity_code': 'str',
    'nppes_provider_street1': 'str',
    'nppes_provider_street2': 'str',
    'nppes_provider_city': 'str',
    'nppes_provider_zip': 'str',
    'nppes_provider_state': 'str',
    'nppes_provider_country': 'str',
    'provider_type': 'str',
    'medicare_participation_indicator': 'str',
    'place_of_service': 'str',
    'hcpcs_code': 'str',
    'hcpcs_description': 'str',
    'hcpcs_drug_indicator': 'str',
    'line_srvc_cnt': 'float64',
    'bene_unique_cnt': 'float64',    
    'bene_day_srvc_cnt': 'float64',
    'average_medicare_allowed_amt': 'float64',
    'average_submitted_chrg_amt': 'float64',
    'average_medicare_payment_amt': 'float64',
    'average_medicare_standard_amt': 'float64',
}
In [6]:
# Get dfs for all years - take a few minutes to run
years = ['2012','2013','2015','2016']
dfs   = []

for year in years:
    file = os.path.join(cms_data_dir, f'cms{year}.txt')
    dtypes = dict(zip(list(map(capitalization_dict[year], partB_dtypes.keys())), list(partB_dtypes.values()))) #get correct column capitalization and dtype
    df = pd.read_csv(file, delimiter='\t', dtype=dtypes)
    df.columns = map(str.lower, df.columns)  # make all variable names lowercase
    df['year'] = year #add Year column 
    dfs.append(df)
In [7]:
# Concatenate
partB_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
partB_df.shape
Out[7]:
(37653939, 30)
In [8]:
# Remove rows corresponding to drugs because LINE_SRVC_CNT for them is not a desirable count
partB_df = partB_df[(partB_df['hcpcs_drug_indicator'] == 'N')]
partB_df.shape
Out[8]:
(35368294, 30)
In [9]:
# Drop missing NPI and HCPCS
# This means dropping 2014 and 2016 - both did not have HCPCS Code
partB_df = partB_df.dropna(subset = ['npi','hcpcs_code'])
partB_df.shape
Out[9]:
(35368294, 30)
In [10]:
# Keep variables based on "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
partB_variables_to_keep = [
    'npi',
    'provider_type',
    'nppes_provider_city', # keep
    'nppes_provider_zip', # keep
    'nppes_provider_state', # keep
    'nppes_provider_country', # keep
    'hcpcs_code',  # not in paper but kept
    'hcpcs_description',  # not in paper but kept
    'hcpcs_drug_indicator',  # not in paper but kept
    'place_of_service',  # not in paper but kept
    'nppes_provider_gender',
    'line_srvc_cnt',
    'bene_unique_cnt',
    'bene_day_srvc_cnt',
    'average_submitted_chrg_amt',
    'average_medicare_payment_amt',
    'year' # need Year for labeling
]
partB_df = partB_df[partB_variables_to_keep]
In [11]:
partB_df.head()
Out[11]:
npi provider_type nppes_provider_city nppes_provider_zip nppes_provider_state nppes_provider_country hcpcs_code hcpcs_description hcpcs_drug_indicator place_of_service nppes_provider_gender line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt year
1 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99222 Initial hospital inpatient care, typically 50 ... N F M 115.0 112.0 115.0 199.0 108.115652 2012
2 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99223 Initial hospital inpatient care, typically 70 ... N F M 93.0 88.0 93.0 291.0 158.870000 2012
3 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99231 Subsequent hospital inpatient care, typically ... N F M 111.0 83.0 111.0 58.0 30.720721 2012
4 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99232 Subsequent hospital inpatient care, typically ... N F M 544.0 295.0 544.0 105.0 56.655662 2012
5 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99233 Subsequent hospital inpatient care, typically ... N F M 75.0 55.0 75.0 150.0 81.390000 2012
In [12]:
partB_df.loc[partB_df['npi'] == '1003000142'][['npi',
                                             'provider_type',
                                             'place_of_service',
                                             'line_srvc_cnt',
                                             'average_submitted_chrg_amt',
                                             'year']][:5]
Out[12]:
npi provider_type place_of_service line_srvc_cnt average_submitted_chrg_amt year
16 1003000142 Anesthesiology O 28.0 216.571429 2012
17 1003000142 Anesthesiology O 24.0 111.000000 2012
9153288 1003000142 Anesthesiology F 56.0 483.000000 2013
9153289 1003000142 Anesthesiology F 16.0 1105.000000 2013
9153290 1003000142 Anesthesiology F 22.0 709.136364 2013
In [13]:
partB_df['year'].value_counts()
Out[13]:
2016    9104337
2015    8904316
2013    8732934
2012    8626707
Name: year, dtype: int64
In [14]:
# Write all combined CMS to csv
#partB_df.to_csv('combined-partB-data-v2')

2. LEIE Dataset

In [15]:
leie_data_dir = os.path.join(CWD, 'LEIEData')
In [16]:
leie_dtypes = {
    'LASTNAME': 'str',
    'FIRSTNAME': 'str',
    'MIDNAME': 'str',
    'BUSNAME' : 'str',
    'GENERAL': 'str',
    'SPECIALTY': 'str',
    'UPIN': 'str',
    'NPI': 'str',
    'DOB': 'str',
    'ADDRESS': 'str',
    'CITY': 'str',
    'STATE': 'str',
    'ZIP': 'str',
    'EXCLTYPE': 'str',
    'EXCLDATE': 'int64',
    'REINDATE': 'int64',
    'WAIVERDATE': 'int64',
    'WVRSTATE': 'str',
}
In [17]:
#LEIE data is monthly between 01/2018 (1801) - 12/2019 (1912)
year_months = ['1801','1802','1803','1804','1805','1806','1807','1808','1809','1810','1811','1812',
            '1901','1902','1903','1904','1905','1906','1907','1908','1909','1910','1911','1912']
dfs = []

for year_month in year_months:
    file = os.path.join(leie_data_dir, f'leie{year_month}-excl.csv')
    df   = pd.read_csv(file, dtype=leie_dtypes)
    df.columns = map(str.lower, df.columns)
    dfs.append(df)
In [18]:
# Concatenate
leie_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
leie_df.shape
Out[18]:
(4983, 18)
In [19]:
leie_df.head()
Out[19]:
lastname firstname midname busname general specialty upin npi dob address city state zip excltype excldate reindate waiverdate wvrstate
0 NaN NaN CHANGING STEPS TREATMENT CENTE OTHER BUSINESS COMMUNITY HLTH CTR ( NaN 1477704351 NaN 14540 HAMLIN ST, STE B VAN NUYS CA 91411 1128a1 20180220 0 0 NaN
1 NaN NaN OLIVE TREE FOSTER HOME OTHER BUSINESS ADULT HOME NaN 0000000000 NaN 94-245 PUPUKOAE STREET HONOLULU HI 96797 1128a1 20180220 0 0 NaN
2 AIRHART LAURA PAULINE NaN IND- LIC HC SERV PRO NURSE/NURSES AIDE NaN 0000000000 19770704 7304 FULLER CIRCLE FORT WORTH TX 76133 1128b4 20180220 0 0 NaN
3 ALBERT AMY NaN IND- LIC HC SERV PRO PHYSICIAN'S ASSISTAN NaN 1679639397 19770818 1124 GAINSBORO ROAD LOWER MERION PA 19004 1128b4 20180220 0 0 NaN
4 ALLEN HEATHER ANITRA NaN IND- LIC HC SERV PRO NURSE/NURSES AIDE NaN 0000000000 19740326 1004 BINGHAM AVE ROWAN IA 50470 1128a3 20180220 0 0 NaN
In [20]:
# Drop NPI = 0, which means missing - A LOT ARE MISSING, which is a problem for the data
leie_df = leie_df[leie_df['npi'] != 0]
leie_df.shape
Out[20]:
(4983, 18)
In [21]:
# Keep exclusions most related to Fraud
exclusions_to_keep = [
    '1128a1',
    '1128a2',
    '1128a3',
    '1128b4',
    '1128b7',
    '1128c3Gi',
    '1128c3gii',
]
leie_df = leie_df[leie_df['excltype'].isin(exclusions_to_keep)]
leie_df.shape
Out[21]:
(4294, 18)
In [22]:
leie_df['excltype'].value_counts()
Out[22]:
1128a1    1988
1128b4    1308
1128a3     523
1128a2     425
1128b7      50
Name: excltype, dtype: int64

Type definitions of physician fraud:

  • 1128a1: Conviction of program-related crimes
  • 1128a2: Conviction of relating to patient abuse or neglect
  • 1128a3: Felony conviction relating to healthcare fraud
  • 1128a4: License revocation, suspension, or surrender
  • 1128a3: Fraud, kickbacks, other prohibited activities
In [23]:
# Write all combined LEIE to csv
#partB_df.to_csv('combined-leie-data')

3. Combine/Label Data

In [24]:
from datetime import datetime, timedelta
import numpy as np
In [25]:
# Convert to datetime
leie_df['excldate'] = pd.to_datetime(leie_df['excldate'], format='%Y%m%d', errors ='ignore')
In [26]:
# Round excl date to the nearest year Johnson & Khoshgoftaar (2019)
def round_to_year(dt=None):
    year = dt.year
    month = dt.month
    if month >= 6:
        year = year + 1
    return datetime(year=year,month=1,day=1)

leie_df['excl_year'] = leie_df.excldate.apply(lambda x: round_to_year(x))
In [27]:
# Make exclusion dict 
# 1215053665 has 2 exclusions, so sort also df to get latest year
excl_year_dict = dict([npi, year] for npi, year in zip(leie_df.sort_values(by='excl_year').npi, leie_df.sort_values(by='excl_year').excl_year))
In [28]:
# Get label as 0 or 1
partB_df['excl_year'] = partB_df['npi'].map(excl_year_dict)
partB_df['excl_year'] = partB_df['excl_year'].fillna(datetime(year=1900,month=1,day=1)) # fill NaN, physicians without exclusion, with year 1900

partB_df['year'] = pd.to_datetime(partB_df['year'].astype(str), format='%Y', errors ='ignore')
partB_df['fraudulent'] = np.where(partB_df['year'] < partB_df['excl_year'], 1, 0) # compare year vs. exclusion year to get Fraudulent
In [29]:
print("partB_df is our combined dataset with shape: {0}".format(partB_df.shape))
partB_df is our combined dataset with shape: (35368294, 19)

4. Draw Visualizations

In [30]:
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.figure_factory as ff
import plotly.graph_objects  as go
from plotly.subplots import make_subplots
In [31]:
# Get number and amount of fraudulent services
partB_df['fraud_line_srvc_cnt'] = partB_df['line_srvc_cnt']*partB_df['fraudulent']
partB_df['fraud_average_submitted_chrg_amt'] = partB_df['average_submitted_chrg_amt']*partB_df['fraudulent']
partB_df['fraud_average_medicare_payment_amt'] = partB_df['average_medicare_payment_amt']*partB_df['fraudulent']
In [32]:
# Aggregate by state
state_df = partB_df.groupby('nppes_provider_state').agg({
    'line_srvc_cnt':[('total_services_count','sum')],
    'fraud_line_srvc_cnt':[('total_fraud_services_count','sum')]
}).reset_index()

# Drop multi-index
state_df.columns = ['_'.join(col) for col in state_df.columns]
state_df.columns = ['provider_state', 'total_services_count', 'fraud_services_count']
In [33]:
# Get % fraud
state_df['fraud_services_pct'] = state_df['fraud_services_count']/state_df['total_services_count']
state_df.head()
Out[33]:
provider_state total_services_count fraud_services_count fraud_services_pct
0 AA 26875.0 0.0 0.000000
1 AE 45118.0 0.0 0.000000
2 AK 7096804.1 0.0 0.000000
3 AL 145179172.2 27972.0 0.000193
4 AP 35607.0 0.0 0.000000
In [34]:
np.seterr(divide = 'ignore') 

fig = go.Figure(data=go.Choropleth(
    locations=state_df['provider_state'],
    z = np.log(state_df['fraud_services_pct'].astype(float))+0.000001, #log-scale
    locationmode = 'USA-states',
    colorscale = 'Reds',
    colorbar_title = "Logged %",
    marker_line_color='white'
))

fig.update_layout(
    title_text = 'Logged Percentage of Medicare Fraudulent Service by US State (2012-2016)',
    geo_scope='usa',
)

fig.show()

Heat map shows a high concentration of fraudulent Medicare claims around the Northeast and South of the US, with Texas leading all states by a wide margin.

In [35]:
# Get dummy variables for gender
partB_df2 = pd.concat([partB_df, pd.get_dummies(partB_df['nppes_provider_gender'])], axis=1)
In [36]:
# Aggregate by provider type
type_df = partB_df2.groupby('provider_type').agg({
    'line_srvc_cnt':['sum'],
    'fraud_line_srvc_cnt':['sum'],
    'M':['sum'],
    'F':['sum'],
    'average_submitted_chrg_amt':['median'], #since distribution skewed right
    'average_medicare_payment_amt':['median'],
    'fraud_average_submitted_chrg_amt':['max'],
    'fraud_average_medicare_payment_amt':['max']
}).reset_index()

# Drop multi-index
type_df.columns = ['_'.join(col) for col in type_df.columns]
type_df.columns = ['provider_type', 'total_services_count', 'fraud_services_count','male_count','female_count',
                   'avg_submitted_chrg_amt','avg_medicare_payment_amt', 'fraud_avg_submitted_chrg_amt','fraud_avg_medicare_payment_amt']
In [37]:
# Sorting
type_df = type_df.sort_values('fraud_services_count',ascending=False)[:15] #get top 15 fraudulent types
type_df = type_df.sort_values('total_services_count',ascending=True).reset_index(drop=True) #re-sort by total services

# Add some fields
type_df['non_fraud_services_count'] = type_df['total_services_count'] - type_df['fraud_services_count']
type_df['fraud_services_pct'] = (type_df['fraud_services_count']/type_df['total_services_count'])*100
type_df.head()
Out[37]:
provider_type total_services_count fraud_services_count male_count female_count avg_submitted_chrg_amt avg_medicare_payment_amt fraud_avg_submitted_chrg_amt fraud_avg_medicare_payment_amt non_fraud_services_count fraud_services_pct
0 Geriatric Medicine 10738563.3 52334.0 42616.0 34393.0 155.594615 62.206552 800.000000 178.185878 10686229.3 0.487346
1 Pain Management 14888967.2 96178.0 90837.0 11511.0 300.000000 68.560030 1980.000000 288.390714 14792789.2 0.645968
2 General Practice 26465791.9 114533.0 150038.0 33452.0 94.000000 41.020000 603.641975 201.138500 26351258.9 0.432759
3 Anesthesiology 44806223.5 189943.0 733834.0 155457.0 718.158333 85.868923 2722.142857 547.420385 44616280.5 0.423921
4 Neurology 54613299.3 108023.0 398365.0 126310.0 221.935065 77.823478 2634.000000 604.060909 54505276.3 0.197796
In [38]:
# Get 2015 data only for speed
partB_2015_df = partB_df2[partB_df2['year'] == datetime(year=2015,month=1,day=1)]
top_7_types  = type_df['provider_type'][:7].tolist()

for p_type in top_7_types:
    
    #Eliminate 0 payments then log
    x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_submitted_chrg_amt!=0)]['average_submitted_chrg_amt'])
    fraud_x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_submitted_chrg_amt!=0)]['fraud_average_submitted_chrg_amt'])
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)

    ax.set(
        title  ='Distribution of Submitted Charge Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    
plt.show()

As shown in histogram comparisons, no obvious evidence to say that the average fraudulent submitted charge is more expensive than non-fraudulent charge across the Top 15 Fraudulent categories.

This does not support the hypothesis I previously had - perhaps because doctors are pretty good at hiding wrongful claims and not making it obvious by listing unreasonably high costs.

In [39]:
for p_type in top_7_types:
    #Eliminate 0 payments
    x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_medicare_payment_amt!=0)]['average_medicare_payment_amt']
    fraud_x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_medicare_payment_amt!=0)]['fraud_average_medicare_payment_amt']
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)
    ax.set(
        title  ='Distribution of Medicare Payment Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    plt.show()

Same conclusion can be made here about Medicare Payment amount.

In [40]:
partB_2015_df = partB_df[partB_df['year'] == datetime(year=2015,month=1,day=1)]

fig, ax = plt.subplots(figsize=(15,7))

sns.heatmap(partB_2015_df.corr(method='pearson'), annot=True, fmt='.4f', 
            cmap=plt.get_cmap('coolwarm'), cbar=False, robust=True)
ax.set_yticklabels(ax.get_yticklabels(), rotation="horizontal")
ax.set(
    title  ='Correlation Heatmap in 2015 (only Continuous Variables)',

      )
plt.show()

No insights are particularly interesting here, outside of the correlations we would normally expect.

In [41]:
# Get categorical variables (except location)
for col in ['nppes_provider_gender','provider_type']:
    partB_2015_df = pd.concat([partB_2015_df, pd.get_dummies(partB_2015_df[col], drop_first= True)], axis=1)
    partB_2015_df = partB_2015_df.drop(col, 1)
In [42]:
def get_redundant_pairs(df):
    '''Get duplicate pairs to drop in correlation matrix after unstacking'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_correlations(df):
    '''Get biggest correlations'''
    au_corr = df.corr().unstack() #unstack
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop)
    return au_corr

#get relevant columns
cols = partB_2015_df.columns[20:].tolist() + ['line_srvc_cnt','bene_unique_cnt','average_submitted_chrg_amt','fraudulent']
corr = get_correlations(partB_2015_df[cols])
In [43]:
corr = corr.to_frame().reset_index() #to frame
corr.columns = ['Variable A', 'Variable B', 'Correlation']

def color_df(value):
    '''Color red if positive, green if negative'''
    if value < 0:
        color = 'red'
    elif value > 0:
        color = 'green'
    else:
        color = 'black'
    return 'color: %s' % color

corr = corr.reindex(corr['Correlation'].abs().sort_values(ascending=False).index).reset_index(drop=True)
corr[:15].style.applymap(color_df, subset=['Correlation'])
Out[43]:
Variable A Variable B Correlation
0 line_srvc_cnt bene_unique_cnt 0.514602
1 Ambulatory Surgical Center average_submitted_chrg_amt 0.282647
2 M Nurse Practitioner -0.270379
3 M Clinical Laboratory -0.179634
4 Diagnostic Radiology Internal Medicine -0.146244
5 Diagnostic Radiology Family Practice -0.128436
6 M Diagnostic Radiology 0.123625
7 Clinical Laboratory bene_unique_cnt 0.121128
8 Family Practice Internal Medicine -0.119511
9 M Physician Assistant -0.117444
10 M Ambulatory Surgical Center -0.117421
11 M Orthopedic Surgery 0.107368
12 Anesthesiology average_submitted_chrg_amt 0.106800
13 M Mass Immunization Roster Biller -0.105435
14 M Cardiology 0.103208

Again, too many interesting conclusions can be made here, apart from the fact that some medical fields/types are dominated by certain gender. The very top, Line Service Count (number of services performed) and Unique Beneficiary Count (number of distinct beneficiaries), make sense as having a positive correlation over 0.5.

Most notable is that Ambulatory Surgical Center and Anesthesiology are quite "expensive" and very positively correlated to Submitted Charge Amount -- an important insight as they are among the high flyers of physician fraud.

In [44]:
fig = go.Figure()

col_layout_dict = {'non_fraud_services_count': ['Non-fraud Service','rgba(50, 171, 96, 0.6)'],
                 'fraud_services_count': ['Fraud Service','rgb(255, 0, 0)']} #dict for layout

for col in ['non_fraud_services_count','fraud_services_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.95],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.95],
    ),
    xaxis_title_text='Count',
)

annotations = [] #annotate with %

x   = type_df['fraud_services_count']+type_df['non_fraud_services_count']+100000000
y_p = np.round(type_df['fraud_services_pct'].tolist(), decimals=2)

for y_p, x, y in zip(y_p,x,type_df['provider_type']):
    annotations.append(dict(xref='x1', yref='y1',
                            y=y, x=x,
                            text=str(y_p) + '%',
                            font=dict(family='Arial', size=12,
                                      color='rgb(255, 0, 0)'),
                            showarrow=False))

annotations.append(dict(xref='paper', yref='paper',
                        x=-0.2, y=-0.209,
                        text='Combined CMS and LEIE data' +
                             'to label the leading Fraudulent physician categories (15 Feb 2020)',
                        font=dict(family='Arial', size=10, color='rgb(150,150,150)'),
                        showarrow=False))

fig.update_layout(annotations=annotations)

fig.show()

By our definition, Anesthesiology, General Practice, Pain Management, and Getriatric Medicine have a high percentage of fraudulent services performed by excluded physicians.

Particularly, 65% of Pain Management cases were wrongfully performed (an overinflated statistic as we qualified ALL previous services performed by an excluded physician as fraudulent if they were on the list - the reality is unlikely for ALL). Regardless, it makes sense that the high flyers would be General Practice, Pain Mangement, and Geriatric Medicine (the elderlies are the most vulnerable).

In [45]:
fig = go.Figure()

#dict for layout
col_layout_dict = {'female_count': ['Female Physicians ','#ffcdd2'],
                 'male_count': ['Male Physicians','#A2D5F2']}

for col in ['female_count','male_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count and Gender',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.90],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.90],
    ),
    xaxis_title_text='Count',
)

fig.show()

The top fraudulent fields/categories are dominated by Male physicians.

Clinical Laboratory and Ambulance Service Provider has a lot of line services but require fewer doctors - so they did not have enough quantity to show on the bar plot?

5. Prep Data for Training

  • Random undersample data with 50:50 class distributions (reduce the majority class to minority until 1:1)
  • One hot encoding
In [46]:
partB_df.columns
Out[46]:
Index(['npi', 'provider_type', 'nppes_provider_city', 'nppes_provider_zip',
       'nppes_provider_state', 'nppes_provider_country', 'hcpcs_code',
       'hcpcs_description', 'hcpcs_drug_indicator', 'place_of_service',
       'nppes_provider_gender', 'line_srvc_cnt', 'bene_unique_cnt',
       'bene_day_srvc_cnt', 'average_submitted_chrg_amt',
       'average_medicare_payment_amt', 'year', 'excl_year', 'fraudulent',
       'fraud_line_srvc_cnt', 'fraud_average_submitted_chrg_amt',
       'fraud_average_medicare_payment_amt'],
      dtype='object')
In [47]:
#drop variables that we are not using
partB_train_df = partB_df.drop(columns=[
    'fraud_average_medicare_payment_amt', #for visual only
    'fraud_average_submitted_chrg_amt', #for visual only
    'fraud_line_srvc_cnt', #for visual only
    'year',
    'excl_year',
    'hcpcs_drug_indicator',
    'place_of_service',
    'nppes_provider_city',
    'nppes_provider_country',
    'hcpcs_description',
    'nppes_provider_zip'
])
In [48]:
a = partB_train_df.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are: {0}% of all data'.format(str(np.round((a[1]/a[0])*100,decimals = 6)))) 
0    35359427
1        8867
Name: fraudulent, dtype: int64
Fraudulent physicians are: 0.025077% of all data
In [49]:
display(partB_train_df.head())
print("We have 9 features, including 4 categorical variables in \n {0}".format(partB_train_df.columns[1:5].tolist()))
npi provider_type nppes_provider_state hcpcs_code nppes_provider_gender line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt fraudulent
1 1003000126 Internal Medicine MD 99222 M 115.0 112.0 115.0 199.0 108.115652 0
2 1003000126 Internal Medicine MD 99223 M 93.0 88.0 93.0 291.0 158.870000 0
3 1003000126 Internal Medicine MD 99231 M 111.0 83.0 111.0 58.0 30.720721 0
4 1003000126 Internal Medicine MD 99232 M 544.0 295.0 544.0 105.0 56.655662 0
5 1003000126 Internal Medicine MD 99233 M 75.0 55.0 75.0 150.0 81.390000 0
We have 9 features, including 4 categorical variables in 
 ['provider_type', 'nppes_provider_state', 'hcpcs_code', 'nppes_provider_gender']
In [50]:
def undersample(df, category, split=0.5):
    '''Random undersampling based on a category and class split (% of dataset belonged to majority class)'''
    fraud_number = df[category].sum()
    non_fraud_indices = df[df[category] == 0].index #get indices of non-fraud
    random_indices = np.random.choice(non_fraud_indices, size=int(round(fraud_number*(split/(1-split)), 0)), 
                                      replace=False) #sample at random without replacement
    
    fraud_indices = df[df[category] == 1].index #get indices of fraud
    under_sample_indices = np.concatenate([fraud_indices, random_indices])
    return df.loc[under_sample_indices]

data = undersample(partB_train_df, 'fraudulent')

#print
a = data.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are {0}% of all data.'.format(str(np.round((a[1]/(a[0]+a[1]))*100,decimals = 2)))) 
1    8867
0    8867
Name: fraudulent, dtype: int64
Fraudulent physicians are 50.0% of all data.
In [51]:
# # We are NOT normalizing the data
# from sklearn import preprocessing

# def scale_predictors(df):
#     '''Takes in df and returns normalized data [0,1] '''
#     x  = df.values #numpy array
#     min_max_scaler = preprocessing.MinMaxScaler()
#     x_scaled = min_max_scaler.fit_transform(x)
#     return pd.DataFrame(x_scaled, columns=df.columns, index=df.index)
In [52]:
# One-hot encoding
def one_hot_encode(df, cols):
    '''Encodes categorical variables in df with cols as an array'''
    for col in cols:
        df = pd.concat([df, pd.get_dummies(df[col], drop_first= True)], axis=1)
        return df

cols = ['provider_type', 'nppes_provider_state', 'hcpcs_code', 'nppes_provider_gender']
data = one_hot_encode(data, cols)
data = data.drop(columns=cols, axis=1).reset_index(drop=True) #drop column that's been encoded
In [53]:
display(data.shape)
print('There are {0} predictors after turning categories into dummy variables.'.format(data.shape[1]-2)) 
(17734, 109)
There are 107 predictors after turning categories into dummy variables.

6. Model:

In [54]:
data.head()
Out[54]:
npi line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt fraudulent All Other Suppliers Allergy/ Immunology Allergy/Immunology ... Registered Dietitian or Nutrition Professional Rheumatology Sleep Medicine Speech Language Pathologist Sports Medicine Surgical Oncology Thoracic Surgery Unknown Physician Specialty Code Urology Vascular Surgery
0 1003870239 14.0 13.0 14.0 265.428571 62.630000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1003870239 610.0 91.0 610.0 648.000000 236.548836 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1003870239 217.0 91.0 217.0 540.000000 200.677419 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1003870239 24.0 19.0 24.0 430.000000 154.030000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1003870239 11.0 11.0 11.0 220.000000 140.120000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 109 columns

In [55]:
data.fraudulent.value_counts()
Out[55]:
1    8867
0    8867
Name: fraudulent, dtype: int64
In [56]:
import statsmodels.api as sm

cols = [col for col in data.columns if col not in ['npi', 'fraudulent']]
X = data[cols]
y = data['fraudulent']

logit_model=sm.Logit(y,X)
result=logit_model.fit(method='bfgs')
print(result.summary2())
/usr/local/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py:1747: RuntimeWarning:

overflow encountered in exp

/usr/local/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py:1747: RuntimeWarning:

overflow encountered in exp

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.569770
         Iterations: 35
         Function evaluations: 56
         Gradient evaluations: 47
                                        Results: Logit
===============================================================================================
Model:                        Logit                      Pseudo R-squared:           0.178     
Dependent Variable:           fraudulent                 AIC:                        20422.6174
Date:                         2020-02-23 15:57           BIC:                        21255.4240
No. Observations:             17734                      Log-Likelihood:             -10104.   
Df Model:                     106                        LL-Null:                    -12292.   
Df Residuals:                 17627                      LLR p-value:                0.0000    
Converged:                    0.0000                     Scale:                      1.0000    
-----------------------------------------------------------------------------------------------
                                                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
-----------------------------------------------------------------------------------------------
line_srvc_cnt                                   0.0000   0.0000   0.5065 0.6125 -0.0000  0.0000
bene_unique_cnt                                -0.0015   0.0002  -9.1021 0.0000 -0.0018 -0.0011
bene_day_srvc_cnt                               0.0009   0.0001   9.5333 0.0000  0.0007  0.0011
average_submitted_chrg_amt                     -0.0001   0.0000  -2.3152 0.0206 -0.0001 -0.0000
average_medicare_payment_amt                   -0.0002   0.0002  -0.8825 0.3775 -0.0006  0.0002
All Other Suppliers                            -0.0240   1.4158  -0.0170 0.9865 -2.7990  2.7510
Allergy/ Immunology                            -0.0454   0.4727  -0.0961 0.9234 -0.9719  0.8811
Allergy/Immunology                              0.0881   0.2759   0.3193 0.7495 -0.4527  0.6288
Ambulance Service Provider                     -0.0098   0.4310  -0.0227 0.9819 -0.8545  0.8349
Ambulance Service Supplier                      0.1326   0.2846   0.4658 0.6414 -0.4253  0.6905
Ambulatory Surgical Center                     -0.5091   0.2964  -1.7175 0.0859 -1.0900  0.0719
Anesthesiologist Assistants                    -0.0687   0.8191  -0.0839 0.9332 -1.6740  1.5367
Anesthesiology                                  0.7822   0.0904   8.6567 0.0000  0.6051  0.9594
Anesthesiology Assistant                       -0.0116   2.0022  -0.0058 0.9954 -3.9358  3.9126
Audiologist                                     0.1695   0.4273   0.3966 0.6916 -0.6680  1.0070
Audiologist (billing independently)             0.2541   0.2967   0.8564 0.3918 -0.3274  0.8356
CRNA                                           -0.1639   0.1744  -0.9397 0.3473 -0.5057  0.1779
Cardiac Electrophysiology                       0.0956   0.2566   0.3724 0.7096 -0.4074  0.5985
Cardiac Surgery                                -0.2214   0.4498  -0.4923 0.6225 -1.1029  0.6601
Cardiology                                      0.2608   0.0674   3.8696 0.0001  0.1287  0.3929
Cardiovascular Disease (Cardiology)             0.1701   0.1276   1.3327 0.1826 -0.0801  0.4202
Centralized Flu                                -0.1643   0.5396  -0.3045 0.7608 -1.2219  0.8933
Certified Clinical Nurse Specialist            -0.0491   0.7078  -0.0694 0.9446 -1.4365  1.3382
Certified Nurse Midwife                        -0.0243   1.4146  -0.0172 0.9863 -2.7968  2.7482
Certified Registered Nurse Anesthetist (CRNA)  -0.1965   0.2886  -0.6811 0.4958 -0.7621  0.3690
Chiropractic                                   -0.2173   0.2188  -0.9931 0.3206 -0.6460  0.2115
Clinical Cardiatric Electrophysiology          -0.1312   0.5849  -0.2244 0.8225 -1.2776  1.0151
Clinical Laboratory                             1.9693   0.1272  15.4807 0.0000  1.7200  2.2186
Clinical Psychologist                           0.3215   0.2128   1.5106 0.1309 -0.0956  0.7387
Colorectal Surgery (Proctology)                -0.0697   0.8186  -0.0852 0.9321 -1.6742  1.5347
Colorectal Surgery (formerly proctology)       -0.1478   0.5589  -0.2644 0.7915 -1.2432  0.9476
Critical Care (Intensivists)                   -0.1635   0.5378  -0.3040 0.7611 -1.2176  0.8906
Dermatology                                    -0.5519   0.1218  -4.5331 0.0000 -0.7906 -0.3133
Diagnostic Radiology                           -5.5707   0.4806 -11.5909 0.0000 -6.5127 -4.6287
Emergency Medicine                             -0.3811   0.1075  -3.5457 0.0004 -0.5918 -0.1704
Endocrinology                                  -0.2113   0.2093  -1.0092 0.3129 -0.6215  0.1990
Family Practice                                 0.0275   0.0447   0.6160 0.5379 -0.0600  0.1150
Gastroenterology                               -1.2662   0.1790  -7.0730 0.0000 -1.6171 -0.9153
General Practice                                1.2963   0.1480   8.7598 0.0000  1.0063  1.5863
General Surgery                                -0.1766   0.1283  -1.3770 0.1685 -0.4280  0.0748
Geriatric Medicine                              2.1642   0.1857  11.6513 0.0000  1.8002  2.5283
Geriatric Psychiatry                           -0.0241   1.4147  -0.0170 0.9864 -2.7967  2.7486
Gynecological Oncology                         -0.0123   2.0000  -0.0062 0.9951 -3.9323  3.9076
Gynecological/Oncology                         -0.0804   0.7594  -0.1059 0.9157 -1.5689  1.4080
Hand Surgery                                   -0.1634   0.5380  -0.3038 0.7613 -1.2178  0.8910
Hematology                                     -0.1091   0.6679  -0.1633 0.8703 -1.4181  1.2000
Hematology-Oncology                            -0.2604   0.2797  -0.9308 0.3519 -0.8086  0.2879
Hematology/Oncology                            -0.3793   0.1675  -2.2638 0.0236 -0.7076 -0.0509
Hospice and Palliative Care                    -0.0350   1.1571  -0.0302 0.9759 -2.3029  2.2330
Independent Diagnostic Testing Facility         1.0246   0.1837   5.5777 0.0000  0.6646  1.3847
Independent Diagnostic Testing Facility (IDTF)  0.5247   0.2932   1.7897 0.0735 -0.0499  1.0994
Infectious Disease                             -0.5233   0.3059  -1.7105 0.0872 -1.1229  0.0763
Internal Medicine                               0.4680   0.0408  11.4662 0.0000  0.3880  0.5480
Interventional Cardiology                      -0.4336   0.3318  -1.3071 0.1912 -1.0839  0.2166
Interventional Pain Management                  0.8873   0.1750   5.0696 0.0000  0.5442  1.2303
Interventional Radiology                       -0.3678   0.3579  -1.0276 0.3041 -1.0692  0.3337
Licensed Clinical Social Worker                 0.1463   0.2704   0.5413 0.5883 -0.3836  0.6763
Mass Immunization Roster Biller                -0.2669   0.4232  -0.6308 0.5282 -1.0963  0.5625
Mass Immunizer Roster Biller                   -0.0835   0.7581  -0.1102 0.9123 -1.5693  1.4023
Maxillofacial Surgery                          -0.0117   2.0020  -0.0058 0.9953 -3.9355  3.9121
Medical Oncology                               -0.4208   0.3453  -1.2187 0.2230 -1.0976  0.2560
Multispecialty Clinic/Group Practice           -0.0131   2.0024  -0.0065 0.9948 -3.9377  3.9116
Nephrology                                     -0.9404   0.2381  -3.9495 0.0001 -1.4070 -0.4737
Neurology                                       1.2510   0.1065  11.7435 0.0000  1.0422  1.4598
Neurosurgery                                   -0.4807   0.3009  -1.5972 0.1102 -1.0705  0.1092
Nuclear Medicine                               -0.0947   0.7090  -0.1335 0.8938 -1.4843  1.2950
Nurse Practitioner                             -0.4686   0.0906  -5.1732 0.0000 -0.6461 -0.2910
Obstetrics & Gynecology                        -0.1224   0.3217  -0.3804 0.7036 -0.7528  0.5081
Obstetrics/Gynecology                          -0.5351   0.1953  -2.7396 0.0062 -0.9180 -0.1523
Occupational Therapist in Private Practice     -0.0539   1.0024  -0.0538 0.9571 -2.0185  1.9107
Occupational therapist                         -0.1940   0.5561  -0.3489 0.7272 -1.2840  0.8959
Ophthalmology                                  -1.6273   0.1674  -9.7222 0.0000 -1.9554 -1.2992
Optometry                                      -0.4388   0.1328  -3.3031 0.0010 -0.6992 -0.1784
Oral Surgery (dentists only)                   -0.0121   2.0004  -0.0061 0.9952 -3.9328  3.9085
Orthopedic Surgery                             -0.2621   0.0974  -2.6909 0.0071 -0.4531 -0.0712
Osteopathic Manipulative Medicine              -0.0195   1.4425  -0.0135 0.9892 -2.8467  2.8078
Otolaryngology                                  0.2330   0.1429   1.6304 0.1030 -0.0471  0.5131
Pain Management                                 0.7519   0.1847   4.0712 0.0000  0.3899  1.1139
Pathology                                      -1.4629   0.2118  -6.9066 0.0000 -1.8781 -1.0478
Pediatric Medicine                              0.3799   0.2827   1.3441 0.1789 -0.1741  0.9340
Peripheral Vascular Disease                    -0.0241   1.4148  -0.0170 0.9864 -2.7971  2.7490
Physical Medicine and Rehabilitation            0.4205   0.1421   2.9588 0.0031  0.1420  0.6991
Physical Therapist                             -0.9009   0.1507  -5.9769 0.0000 -1.1963 -0.6055
Physical Therapist in Private Practice         -0.5349   0.2408  -2.2210 0.0264 -1.0069 -0.0629
Physician Assistant                            -0.9488   0.1145  -8.2874 0.0000 -1.1731 -0.7244
Plastic and Reconstructive Surgery             -0.3484   0.3675  -0.9480 0.3431 -1.0688  0.3719
Podiatry                                        2.0363   0.1062  19.1660 0.0000  1.8281  2.2446
Portable X-Ray Supplier                        -0.0252   1.4145  -0.0178 0.9858 -2.7976  2.7473
Portable X-ray                                 -0.1681   0.5395  -0.3117 0.7553 -1.2255  0.8893
Preventive Medicine                            -0.0268   1.4218  -0.0188 0.9850 -2.8135  2.7600
Psychiatry                                      0.2989   0.1382   2.1639 0.0305  0.0282  0.5697
Psychologist (billing independently)            0.1678   0.5188   0.3234 0.7464 -0.8490  1.1846
Psychologist, Clinical                          0.1683   0.3808   0.4419 0.6586 -0.5780  0.9145
Pulmonary Disease                              -1.2821   0.2151  -5.9612 0.0000 -1.7037 -0.8606
Radiation Oncology                             -0.8653   0.2437  -3.5500 0.0004 -1.3430 -0.3876
Radiation Therapy                              -0.0117   2.0045  -0.0058 0.9953 -3.9405  3.9171
Registered Dietician/Nutrition Professional    -0.0370   1.1549  -0.0321 0.9744 -2.3006  2.2265
Registered Dietitian or Nutrition Professional -0.0124   2.0004  -0.0062 0.9950 -3.9331  3.9082
Rheumatology                                   -0.4333   0.2356  -1.8394 0.0659 -0.8950  0.0284
Sleep Medicine                                 -0.0240   1.4152  -0.0170 0.9865 -2.7978  2.7498
Speech Language Pathologist                     0.0418   1.0252   0.0408 0.9675 -1.9677  2.0512
Sports Medicine                                -0.1428   0.5799  -0.2462 0.8055 -1.2793  0.9937
Surgical Oncology                              -0.0399   1.0171  -0.0392 0.9687 -2.0334  1.9536
Thoracic Surgery                               -0.1474   0.5550  -0.2656 0.7905 -1.2352  0.9403
Unknown Physician Specialty Code               -0.0237   1.4160  -0.0167 0.9867 -2.7991  2.7517
Urology                                        -1.2434   0.1619  -7.6800 0.0000 -1.5608 -0.9261
Vascular Surgery                               -0.1962   0.1998  -0.9822 0.3260 -0.5878  0.1954
===============================================================================================

/usr/local/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

Future Steps

1) Split train vs. test set - understand how well a model performs and its evaluation metrics (precision, recall, AUC). Variable transformation, for example, is a step towards achieving a better model.

2) Here, we only tried a 50:50 class distribution for fraud vs. non fraud. It will be worth to experiment with different class splits and compare results.

3) Try other classification models that previous literature has found to be productive (XGBoost, SVM).

References